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Abstract. During the course of evolution, an organism's genome can 
undergo changes that affect the large-scale structure of the genome. These 
changes include gene gain, loss, duplication, chromosome fusion, fission, 
and rearrangement. When gene gain and loss occurs in addition to other 
types of rearrangement, breakpoints of rearrangement can exist that are 
only detectable by comparison of three or more genomes. An arbitrarily 
large number of these "hidden" breakpoints can exist among genomes that 
exhibit no rearrangements in pairwise comparisons. 

We present an extension of the multichromosomal breakpoint median prob- 
lem to genomes that have undergone gene gain and loss. We then demon- 
strate that the median distance among three genomes can be used to cal- 
culate a lower bound on the number of hidden breakpoints present. We 
provide an implementation of this calculation including the median dis- 
tance, along with some practical improvements on the time complexity of 
the underlying algorithm. 

We apply our approach to measure the abundance of hidden breakpoints 
in simulated data sets under a wide range of evolutionary scenarios. We 
demonstrate that in simulations the hidden breakpoint counts depend 
strongly on relative rates of inversion and gene gain/loss. Finally we apply 
current multiple genome aligners to the simulated genomes, and show that 
all aligners introduce a high degree of error in hidden breakpoint counts, 
and that this error grows with evolutionary distance in the simulation. Our 
results suggest that hidden breakpoint error may be pervasive in genome 
alignments. 



1 Introduction 

Genome rearrangement plays a fundamental role in biological processes includ- 
ing cancer [8], gene regulation, and development and a better understand- 
ing of genome rearrangement is expected to lend insight into these biological 
processes. The primary evidence for genome rearrangement in modern genomes 
comes from identifying rearrangement breakpoints in alignments of two or more 
genomes. Genome alignments and the rearrangement breakpoints they encode 
provide a basis for reconstructing genome rearrangement histories. The accurate 
reconstruction of rearrangement history depends intimately on whether genome 
alignments can accurately identify breakpoints. 
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When genomes have undergone gene loss in addition to rearrangement some 
breakpoints may only be detectable by comparison of three or more genomes. 
This class of breakpoints has received limited attention to date. We refer to 
such breakpoints as "hidden" breakpoints, since usage of these rearrangement 
breakpoints in an organism's evolutionary history is not apparent from pairwise 
comparison of available genomes. Hidden breakpoints occur in regions of the 
genome conserved in some, but not all of the genomes under comparison. A 
simplest possible example of hidden breakpoints involves the three genomes A, 
B, C, with blocks 1, 2, 3 (see Fig.[l]). 

By extending a recent solution to the breakpoint median problem |14j . we 
demonstrate that it is possible to calculate a hidden breakpoint count in three 
genomes when there is gain and loss. Our method is founded on the premise that 
a parsimonious reconstruction of the median genome would contain any sequence 
content present in at least two of the three genomes, while sequence content 
unique to a single genome would not be in the median. By comparing pairwise 
distances to the median distance, our method reveals some of the hidden break- 
points. Having implemented the method to calculate hidden breakpoint counts, 
we then conduct a simulation study to measure the abundance of hidden break- 
points under a range of different genome evolution parameters. We apply current 
multiple genome alignment systems to calculate alignments of the simulated se- 
quences. Finally, we count hidden breakpoints in the calculated alignments and 
compare with the true number of hidden breakpoints in the simulated sequences. 
We show that all tested genome alignment algorithms introduce error in hidden 
breakpoint estimation and characterize their error rates under a range of evolu- 
tionary parameters in simulation. 

2 Pairwise and Hidden Breakpoints 

Genome alignments consist of sequence regions that are aligned in two or more 
genomes. Each aligned sequence region, often called a synteny block or locally 
collinear block, is internally free from any rearrangement but may contain gaps. 
Here we simply call these regions blocks. We assign an integer-valued identifier to 
each block of an alignment, using the sign of the integer to represent whether the 
block occurs in a genome in the forward or in reverse orientation. In this way, the 
arrangement of a single genome g in a genome alignment can be represented as a 
sequence of signed integers: g — b\ 62 ... b n , h € Z. Note that this sequence of 
integers does not necessarily represent all parts of the genome, i. e. there may be 
unaligned, unique segments in-between blocks. Genomes themselves may consist 
of one or multiple chromosomes with either linear or circular topology. When a 
genome contains multiple chromosomes, it cannot be represented as a single se- 
quence of integers, but rather as a set of sequences: g — {bi . . . bi, bj+i . . . bi + j, . . .}. 
Following previous literature we define linear chromosomes to be bounded by zero 
length telomeres, acknowledging the discrepancy between this definition and its 
use in the biological literature to describe a string of nucleotides of length > 
near the ends of a linear chromosome. We use • to denote telomeres in a linear 
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Fig. 1. Multiple alignment (left) and possible evolutionary scenario (right) of genomes 
A, B, and C containing a rearrangement not detectable in pairwise comparison. In 
all three pairwise projections two of the blocks become unique leading to a breakpoint 
distance of 0. 



chromosome: • b\ ... b n •. 

In the present work we investigate rearrangement breakpoints in groups of 
two or three genomes, although an alignment might contain many more genomes 
gi , . . . , g m . This requires us to define a projection of an alignment to those blocks 
present in the current set of genomes. For three genomes g x , g y , and g z with 
x,y,z £ {1, . . . , m} we denote the projection of the alignment as n xyz , and a 
genome in the projected alignment by TT xyz (g). We define the projected genome 
Ttxyz(gx) as the subsequence of g x containing those blocks present in at least two 
of the genomes used for projection. We thus remove blocks that are unique to 
9xi 9yi or 9z from consideration in the three-way comparison since they can not 
reveal direct evidence of rearrangement breakpoints. 

Two blocks bi, bj that are consecutive within a genome (possibly with unique 
segments in-between) define an adjacency. In a pairwise projection of a genome 
alignment, a pairwise breakpoint is an adjacency from one genome that is missing 
in the other. By counting breakpoints in one genome g x relative to another 
g y we can calculate the breakpoint distance d{g x ,g y ). Breakpoints at telomeres 
count only | in d(g x ,g y ), as only breakage and no fusion to another loose end 
has happened. When g x and g y have equal content d(g x ,g y ) is equivalent to the 
classic breakpoint distance. If different sets of blocks are present in g x and g y , the 
breakpoint distance may be asymmetric, i.e. d(g x ,g y ) ^ d(g y ,g x ). For brevity 
we define d xyz (g Xl g y ) := d{-K xyz (g x ), n xyz (g y )). 

Previous work has applied pairwise breakpoint distance in a sum-of-pairs score 
for multiple genome alignment 4J. However, multiple alignments may reveal 
rearrangements that are not visible in pairwise projections (see example in Fig. [I]), 
information that sum-of-pairs scores do not capture. The breakage events of these 
rearrangements are hidden by gain or loss of rearranged blocks or by breakpoint 
re-use. We thus use the term hidden breakpoint to refer to this class of breakpoints. 

3 The Median Approach for Counting Breakpoints 

We now describe a method to compute a hidden breakpoint count T-L for three 
genomes. For alignments with m > 3 genomes, T-L can be calculated for all 
projections to three genomes. 
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A median genome: 




Fig. 2. Alignment of three genomes and a corresponding median genome. The pairwise 
distances to the median M are M) — 1.5, d(g2, M) = 0, and d(g3, M) = 0.5. 



A median M for three genomes g x , g y , g z with x,y, z € {1, . . . , m} is a genome 
formed on all blocks in ir xyz . The blocks in M are arranged such that the sum 

du ■= X! d xyz (g k ,M) 

is minimized (see Fig. [2] for an example). This represents a generalization of 
previous definitions of a median genome, e.g. |14j . to include all blocks in at 
least two out of three genomes g x ,g y ,g z - Previous breakpoint median methods 
required blocks to be conserved among all genomes. As we describe in Section [3.3[ 
removing the portion of dm attributed to pairwise breakpoints yields a hidden 
breakpoint count. We note that our method only requires computation of du 
and does not depend on accurate reconstruction of the actual median genome, 
for which many possibilities with minimal du may exist. We also note that 
duplications are forbidden; each block may occur at most once per genome. 

We compute the median distance dyi by applying a recent polynomial time 
solution |14j to our generalization of the median definition. Following the proof 



of Theorem 1 in |14j . we construct a graph from the genomes (see Sect. 3.1 for 
details) , and compute a maximum weight perfect matching which corresponds to 
a median genome [14] . Because counting hidden breakpoints requires only the 
median distance du and not the actual median, we are able to improve the time 
complexity of the solution from |14j by reducing the number of graph edges (see 



Sect. 3.2) input to the perfect matching. Note that this reduction eliminates some 



valid medians, but the median distance d^ is unchanged. 



3.1 Graph Construction. 

We first describe the graph as used in [T3] to compute a median genome. Figure[3] 
(left) shows an example graph for the alignment from Fig. [2] Given a genome 
alignment, the graph G = (VUT, Ey U Et U Ejj) contains four vertices from two 
different vertex sets V and T for each block bi of the alignment: A head vertex 
v"? € V representing the start of the block, a tail vertex v\ g V representing the 
end of the block, a head telomere vertex t\ G T, and a tail telomere vertex t\ G T . 
The telomere vertices t\ and t\ represent the possibilities that chromosomes start 
or end with block bi. It is necessary to have separate telomere vertices for each 
block to be able to find all (possibly multichromosomal) medians by the perfect 
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Fig. 3. Full graph (left) and a simplified graph (right) for the example alignment from 
Fig. [2] The simplified graph is sufficient to calculate the median distance djvr- Line width 
indicates edge weights. Dotted edges have a weight of 0. Red edges form a maximum 
weight perfect matching that corresponds to the median genome shown in Fig. [2] 



matching computation. All vertices from V are connected among each other by 
edges ey € Ey , and all vertices from T are connected among each other by edges 
&t G Exi i-e. the two subgraphs are complete. In addition, each head vertex v\ 
is connected to its corresponding head telomere vertex t^, and each tail vertex 
v\ to its corresponding tail telomere vertex t\ by edges eu £ Ejj. Initially, all 
edges have a weight of 0. We increase edge weights according to adjacencies in 
the genomes. For telomere adjacencies, we increase weights of edges in Ejj by h 
following [14]. For all other adjacencies we increase weights of edges from Ey by 
1. Depending on the orientation (sign) of the adjacent blocks, we increase the 
weight of the edge between a tail and head vertex, tail and tail, head and tail, or 
head and head vertex by 1. All edges in Et keep a weight of 0. 

3.2 Reducing the Number of Edges. 

Having constructed the graph, we compute a maximum weight perfect matching. 
The running time of the most efficient maximum weight perfect matching algo- 
rithms (e. g. depends on the number of edges in the graph, which in our case 
is quadratic in the number of blocks. In the following we show how it is possible 
to reduce the number of edges by a linear factor. 

A perfect matching is a subset of the graph's edges Em Q Ey U Et U Ejj such 
that each vertex of the graph is incident to exactly one edge in the matching. A 
maximum weight perfect matching is the perfect matching that maximizes the 
sum of weights of all edges e € Em ■ Figure [3| (le ft) demonstrates that many of the 
edges in a graph constructed as described in |3.1| have a weight of 0, and hence, will 
not contribute to the weight of a maximum weight perfect matching. Still, some 
of those edges are necessary to allow a perfect matching of maximum weight, but 
not all of them. In the following paragraph we explain why edges with a weight of 
in Ey between vertices vf and v?, with p,q G {h, t} and with bi,bj being blocks, 
can be removed from the graph as well as corresponding edges in Et between 
vertices and without affecting the weight of the resulting maximum weight 
perfect matching. 

A graph that was constructed as described above can have multiple maximum 
weight perfect matchings. Given any of these matchings, we can replace the set of 
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edges Em H Et by those edges between vertices t? , tj G T, for which the matching 
contains an edge between vf,Vj G V. This replacement docs not violate the 
matching property and does not affect the weight of the matching since all edges 
in Et have a weight of 0. The matching may then contain some pairs of 0- weight 
edges from Ey and Et connecting vertices v?,Vj G V and connecting vertices 
ti,tj G T. We can replace such pairs of edges by edges from E\j between vf and 
and between vj and tj, again without violating the matching property and 
affecting the weight of the matching. After this replacement none of the 0-weight 
edges from Ey and the corresponding edges from Et are part of the matching. 
Therefore, a maximum weight perfect matching exists that does not use these 
pairs of edges, and we can remove them from the initial graph without affecting 
the weight of the resulting maximum weight perfect matching. 

3.3 Hidden Breakpoint Counts. 

A maximum weight perfect matching on the above described graph assigns each 
head or tail vertex to its telomere vertex or to another head or tail vertex by 
edges. These edges correspond to the adjacencies of a median genome. The edges 
that are not part of the matching but have a weight greater than 0, correspond 
to breakpoints among the median and one or more other genomes. Thus, by 
calculating the weight difference of the original graph and the matching, we obtain 
the total number of breakpoints (1m between the three original genomes g x , g yi 
and g z and the median genome M. However, cIm overestimates the actual number 
of breakpoints among the three genomes. The median genome may contain blocks 
that are not present in one of the genomes, which disrupt adjacencies of blocks that 
occur in the same order without rearrangement. These disrupted adjacencies are 
neither pairwise nor hidden rearrangement breakpoints. Therefore, we calculate 
a separate hidden breakpoint count 

n = d M - I ■ 

The second term / removes the fraction of cLm which is due to breakpoints observ- 
able in pairs of genomes including those disrupted adjacencies caused by unequal 
block content. / is defined as: 

f =\ d xyz(9a, M ab ) + d xyz {g b , M ab ) . 

a<b£{x,y,z} 

For the computation of / we use a median M xy of two genomes g Xl g v , which is 
formed on all blocks of the triplet projection n xyz . We can compute M xy using the 
same approach as for a median of three genomes, i. e. construction of the graph 
and computation of a maximum weight perfect matching. The multiplication by \ 
is required because the distance from each genome to a median is counted exactly 
twice (to different medians). 

In the results section we discuss hidden breakpoints in alignments contain- 
ing arbitrarily many genomes. In alignments with more than three genomes, wc 
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calculate H for all projections to three genomes and report the sum over all pro- 
jections. Likewise, for pairwise breakpoints we report the sum over all breakpoints 
in pairwise projections. 

4 Simulating Evolution 

We simulated genome evolution to generate sets of evolved sequences and genome 
alignments relating them using the previously described method sgEvolver [3]. 
Briefly, sgEvolver applies the standard Markov process model of evolution (spe- 
cifically a marked Poisson process), with mutation events including substitutions, 
insertions and deletions of three different size distributions, and inversions. In 
our simulation each of the mutation types has the following properties: 

Nucleotide substitutions follow an HKY process with a transition/transver- 
sion ratio of 4 and background nucleotide frequencies A = 0.265, C = 0.235, 
T = 0.265, G = 0.235, as implemented in the program seq-gen [15]. Gamma- 
distributed heterogeneity in substitution rates across sites was simulated 
with the shape parameter k = 1. 

Indels comprise the first class of insertions and deletions. Indel lengths follow 
a Poisson distribution with A = 3. 

Small gain/loss events comprise the second insertion/deletion class and have 
lengths geometrically distributed with p — 200. 

Large gain/loss events comprise the final size class of insertions and deletions 
and have lengths uniformly distributed between 10000 and 50000. 

Inversions follow a geometric length distribution with p = 50000. 

These events were simulated on a phylogeny containing nine extant taxa. All 
insertion/deletion events are simulated to have uniformly random distributed po- 
sitions in the genome. The rate of each event type is specified by a relative rate 
parameter; the expected event count per site on a particular branch of a phylogeny 
is the product of branch length and relative rate. The ancestral genome in our 
simulations is 1 Mbp in size, and since insertions and deletions occur with equal 
probability, the expected genome size remains constant throughout the simulation 
process (though the variance grows). 

The approach used by sgEvolver to simulate insertions can also introduce 
duplications even though they are not directly modeled. In order to simulate 
insertions, sgEvolver maintains a finite pool of "donor" genomic material. When 
an insertion occurs, the sequence to insert is sampled uniformly at random from 
the donor material pool. This approach is intended to model the total pool of gene 
content available in a population, sometimes called a pan-genome [10] . When the 
simulation includes a large enough number of insertion events, especially "large 
gain/loss", it becomes likely that the same material from the donor pool will 
be inserted more than once, effectively creating a duplication in the simulated 
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genome. sgEvolver does not keep track of the donor pool positions for insertion, 
and therefore the simulated genome alignments ignore possible duplications. 

When simulating 1 Mbp genomes, the simulation and subsequent genome align- 
ment process is fast enough that we can analyze a large range of parameters in 
parallel on a compute cluster. We report results on various parameter ranges 
below. 



5 Hidden Breakpoints in Simulated Data 

In this section we examine the effect of inversion, gain/loss, and nucleotide sub- 
stitution rate on pairwise and hidden breakpoints. We expect hidden breakpoints 
to be driven by inversion and gain/loss, while nucleotide substitution may affect 
breakpoint counts in calculated alignments by increasing the difficulty of accurate 
alignment. 

We describe two simulated data sets below, InvNt and InvGL, consisting of 400 
and 200 genome alignments respectively. On each alignment of the datasets we 
computed the true number of pairwise breakpoints and hidden triplet breakpoints 



(see Sect. 5.1 ). Then in Sect. 5.3 we present calculated alignments of the simulated 
sequences done with three different genome alignment systems. We discuss how 
calculated alignments generally overestimate breakpoint counts and relate this 
error to evolutionary rates and other measures of alignment accuracy. 



5.1 True Breakpoint Counts in Simulated Genome Alignments. 

The InvNt data set enables examination of the influence of the inversion and 
nucleotide substitution rate on the breakpoint counts. We expect nucleotide 
substitutions to have no effect on breakpoint counts in simulated alignments, 
although as we discuss later in calculated alignments a high substitution rate af- 
fects estimated breakpoint counts by making alignment difficult. We simulated 
400 alignments with inversion rates between and 2 x 10~ 4 in steps of 1 x 10~ 5 
and nucleotide substitution rates between and 1 in steps of 0.05. All three in- 
sertion/deletion rates were fixed: indels at 1 x 10~ 3 , small gain/loss at 5 x 10 -4 , 
and large gain/loss at 2 x 10~ 5 . Previous work has demonstrated that alignment 
algorithms can reconstruct highly accurate genome alignments at these relatively 
low insertion/deletion rates [I]. In the alignments we observe to 3100 pairwise 
breakpoints and to 1150 hidden triplet breakpoints (see Fig.[4j\) . Despite these 
large ranges, the ratio between pairwise and hidden breakpoints stays constant. 
The coloring in Fig. [4]^ (top) shows that with growing inversion rates, we obtain 
more breakpoints, both pairwise and hidden triplet breakpoints. As one would 
expect, the nucleotide substitution rate has no direct effect on the number of 
breakpoints (Fig.[4]4_, bottom). 

In the InvGL data set we examine breakpoint counts for different large gain/loss 
and different inversion rates. InvGL consists of 200 alignments with large gain/loss 
rates between and 1 x 10 -4 step size 1 x 10~ 5 and again inversion rates between 
and 2 x 10~ 4 in steps of 1 x 10~ 5 . The other two insertion/deletion rates were 
fixed at the same values as above, and the nucleotide substitution rate at 0.01. As 
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with InvNt, we observe more pairwise and hidden triplet breakpoints for grow- 
ing inversion rates (Fig.[4p, top), but in contrast to InvNt, the ratio between 
pairwise and hidden breakpoint counts varies. The coloring in Fig.[4jl> (bottom) 
demonstrates that this variation is due to different gain/loss rates: With growing 
gain/loss rate, we observe more hidden triplet breakpoints whereas the number 
of pairwise breakpoints goes down slightly. This result is consistent with our 
definition of hidden breakpoints. The slight decrease in pairwise breakpoints at 
higher gain/loss rates is due to removal of breakpoints by gene loss. The hidden 
breakpoint count is also affected by gene loss in the InvGL simulation, reaching 
a maximum at a gain/loss rate of about 0.8. 

5.2 Tested Alignment Programs and Evaluation Metrics. 

Having established the true counts of hidden triplet breakpoints under a range 
of evolutionary scenarios, we continue by evaluating whether some current algo- 
rithms can infer genome alignments with the correct number of hidden break- 
points. The algorithm to calculate hidden triplet breakpoints requires positional 
homology alignments, the method cannot be applied to genome alignments con- 
taining paralog alignments. For this reason we selected and applied three current 
programs that generate positional homology genome alignments: TBA, progres- 
siveMauve, and Mugsy. 

The Threaded Blockset Aligner (TBA) [2 constructs multiple genome 
alignments by a process of merging and filtering pairwise alignments generated 
by BLASTZ [T3]. TBA requires specification of a phylogenetic guide tree, and 
for this we gave it the same topology used for simulation. We used TBA version 
2009-Jan-21, the most recent version publicly available at the time of this work. 
The progressiveMauve genome alignment algorithm [1] begins by identifying 
approximate multi-matches among input genomes, then progressively grouping 
these into locally collinear blocks and aligning nucleotides within these blocks. 
For grouping matches progressiveMauve applies a sum-of-pairs breakpoint scor- 
ing scheme, which penalizes pairwise breakpoints. In the present work we used 
progressiveMauve version 2011-02-02 with default options. Mugsy is a newer 
aligner [T] that also uses three steps as progressiveMauve, but differs in the de- 
tails of each step. For example, it uses a min-cut max-flow algorithm to partition 
the multi-matches into blocks under a synteny score. We used Mugsy vlr2.2 with 
default options. 

For each alignment from the three aligners we computed the sum of hidden 
triplet breakpoint counts H. To assess the error that aligners introduce to align- 
ments in terms of breakpoints, we calculated the difference between the number 
of breakpoints in calculated alignments and the number of breakpoints in the true 
alignments. We also calculated a score that measures nucleotide alignment accu- 
racy independently of genome arrangement: We compared each pair of aligned 
nucleotides in the calculated alignments with the true alignment to obtain pre- 
cision and recall of nucleotide positional homology prediction, and combined the 
two values to an Fi score. 
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Fig. 4. Hidden triplet breakpoint counts in simulated alignments with different inversion 
and nucleotide substitution rates (InvNt, A-C), and 200 simulated alignments with 
different inversion and large gain/loss rates (InvGL, D-F). 
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5.3 Breakpoint Counts in Calculated Genome Alignments. 

We computed alignments with TBA, progressiveMauve, and Mugsy on all 600 sets 
of nine evolved genomes in the two data sets InvNt and InvGL. Figures [4)3 and|4p 
show hidden triplet breakpoint counts of the alignments from InvNt, and Fig.[4ji] 
and|4^ for InvGL. In Fig.[4j3 and[4ji!, we display the median of the hidden triplet 
breakpoint counts per program for alignments simulated with equal inversion (Hp) 
or gain/loss rate (|4j3) as lines. 

In alignments calculated on InvNt we found that all three programs over- 
estimate the hidden breakpoint counts by a constant value independent of the 
inversion rate, suggesting a relationship with the constant gain/loss rate in this 
dataset. The two aligners that use an algorithm designed to minimize pairwise 
breakpoints introduce fewer additional hidden triplet breakpoints than TBA. The 
Fi score of most alignments is very good (> 0.95) except for some Mugsy align- 
ments. We recognize a trend that alignments with Fx scores closer to 1 have 
lower hidden triplet breakpoint errors. An exception are Mugsy alignments on 
sequences simulated with nucleotide substitution rates greater than 0.4. These 
Mugsy alignments have a very low recall suggesting the simulated genomes were 
too divergent to be aligned by Mugsy with its default settings. 

Calculated alignments of InvGL data show that the amount of gene gain/loss 
has a strong effect on the hidden triplet breakpoint error. The error is small at low 
rates of gain/loss, but grows quickly with the gain/loss rate. In Figure]^ we see 
that accurate alignments as measured by Fi have lower hidden triplet breakpoint 
error. We find that the Fi score drops with increasing gain/loss rates (data not 
shown), suggesting that genomes with high rates of simulated gain/loss may be 
generally difficult to align. 

6 Conclusions 

We have introduced the concept of hidden breakpoints in genome alignments 
and presented a method to calculate hidden triplet breakpoint counts. We have 
examined hidden breakpoint counts in simulated alignments evolved over a wide 
range of evolutionary parameters, and in alignments calculated by three different 
genome alignment algorithms on the same simulated datasets. We find that all 
tested genome aligners introduce a high degree of error in hidden breakpoint 
counts compared to the true count, and that this error grows with evolutionary 
distance in the simulation. Our results suggest that hidden breakpoint error may 
be pervasive in genome alignments. The error may in turn lead to erroneous 
inference of ancestral genomes and rearrangement history. Therefore, studies of 
the relationship between genome rearrangement history and biological processes 
could be improved by considering this error during the computation of genome 
alignments. 

Overall, the error in triplet breakpoint counts is of the same order of magni- 
tude as the pairwise breakpoint error. Methods such as the sum-of-pairs break- 
point score in progressiveMauve reduce the pairwise breakpoint error, but may 
not effectively reduce hidden breakpoint error. The max-flow min-cut algorithm 
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employed by Mugsy appears to yield the most precise block and breakpoint esti- 
mates, but only when substitution rates are low enough. It is possible that the use 
of a more sensitive matching method such as promer [6J or another approximate 
matching approach would enable Mugsy to produce high accuracy results also 
at higher levels of sequence divergence. Future alignment systems would benefit 
from approaches that either explicitly or implicitly consider hidden breakpoints. 

Limitations. The presented concept of hidden breakpoints has some limita- 
tions. First, the pairwise breakpoint distance is a lower bound for the number of 
breakage events that happened during evolution because several rearrangement 
events may use the same breakpoint. The hidden breakpoint count described 
here improves the estimate of breakage events, but it remains a lower bound to 
the actual number of breakage events that happened during evolution. A further 
limitation is that the current approach does not directly give rise to a method for 
phylogenetic inference on genome arrangement. Others have demonstrated prac- 
tical algorithms for phylogenetic inference on genome arrangement using DCJ 
medians with gene gain and loss [TB]. Although our median algorithm could be 
used in such a context, there is no constraint on the number or topology of chro- 
mosomes in median genomes and this might yield inferred ancestral genomes that 
are extremely biologically unlikely. Another limitation is the perfect-matching 
based median approaches cannot handle duplicated blocks in the alignments. Fi- 
nally, we have only derived a formula to calculate hidden breakpoints among three 
genomes but some hidden breakpoints may be identifiable only among 4 or more 
genomes. We leave identification of hidden breakpoints in 4 or more genomes, 
and analysis of genomes with duplications as future work. 
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